71 research outputs found
Protein sectors: statistical coupling analysis versus conservation
Statistical coupling analysis (SCA) is a method for analyzing multiple
sequence alignments that was used to identify groups of coevolving residues
termed "sectors". The method applies spectral analysis to a matrix obtained by
combining correlation information with sequence conservation. It has been
asserted that the protein sectors identified by SCA are functionally
significant, with different sectors controlling different biochemical
properties of the protein. Here we reconsider the available experimental data
and note that it involves almost exclusively proteins with a single sector. We
show that in this case sequence conservation is the dominating factor in SCA,
and can alone be used to make statistically equivalent functional predictions.
Therefore, we suggest shifting the experimental focus to proteins for which SCA
identifies several sectors. Correlations in protein alignments, which have been
shown to be informative in a number of independent studies, would then be less
dominated by sequence conservation.Comment: 36 pages, 17 figure
Recommended from our members
Statistical and machine learning approaches to predicting protein-ligand interactions.
Data driven computational approaches to predicting protein-ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein-ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes
Recommended from our members
Comparative analysis of nanobody sequence and structure data.
Nanobodies are a class of antigen-binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity-how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein-binding nanobodies with antigen-bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops
Charge as a Selection Criterion for Translocation through the Nuclear Pore Complex
Nuclear pore complexes (NPCs) are highly selective filters that control the exchange of material between nucleus and cytoplasm. The principles that govern selective filtering by NPCs are not fully understood. Previous studies find that cellular proteins capable of fast translocation through NPCs (transport receptors) are characterized by a high proportion of hydrophobic surface regions. Our analysis finds that transport receptors and their complexes are also highly negatively charged. Moreover, NPC components that constitute the permeability barrier are positively charged. We estimate that electrostatic interactions between a transport receptor and the NPC result in an energy gain of several kBT, which would enable significantly increased translocation rates of transport receptors relative to other cellular proteins. We suggest that negative charge is an essential criterion for selective passage through the NPC.Merck Research LaboratoriesNational Science Foundation (U.S.) (Division of Mathematical Sciences)Kavli Institute for Bionano Science & Technology at Harvard UniversityNational Centers for Systems Biology (U.S.) (NIGMS grant GM068763)National Institute of General Medical Sciences (U.S.
Conservation Weighting Functions Enable Covariance Analyses to Detect Functionally Important Amino Acids
The explosive growth in the number of protein sequences gives rise to the possibility of using the natural variation in sequences of homologous proteins to find residues that control different protein phenotypes. Because in many cases different phenotypes are each controlled by a group of residues, the mutations that separate one version of a phenotype from another will be correlated. Here we incorporate biological knowledge about protein phenotypes and their variability in the sequence alignment of interest into algorithms that detect correlated mutations, improving their ability to detect the residues that control those phenotypes. We demonstrate the power of this approach using simulations and recent experimental data. Applying these principles to the protein families encoded by Dscam and Protocadherin allows us to make testable predictions about the residues that dictate the specificity of molecular interactions
Inferring interaction partners from protein sequences.
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm's performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data.Human Frontier Science Program, National Institutes of Health (Grant ID: R01-GM082938), National Science Foundation (Grant ID: PHY-1305525), Marie Curie (Career Integration Grant ID: 631609), Next Generation Fellowship, Eric and Wendy Schmidt Transformative Technology FundThis is the author accepted manuscript. The final version is available from the Proceedings of the National Academy of Sciences of the United States of America via https://doi.org/10.1073/pnas.160676211
Recommended from our members
Critiquing Protein Family Classification Models Using Sufficient Input Subsets.
In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential
Optimal Design of Experiments by Combining Coarse and Fine Measurements.
In many contexts, it is extremely costly to perform enough high-quality experimental measurements to accurately parametrize a predictive quantitative model. However, it is often much easier to carry out large numbers of experiments that indicate whether each sample is above or below a given threshold. Can many such categorical or "coarse" measurements be combined with a much smaller number of high-resolution or "fine" measurements to yield accurate models? Here, we demonstrate an intuitive strategy, inspired by statistical physics, wherein the coarse measurements are used to identify the salient features of the data, while the fine measurements determine the relative importance of these features. A linear model is inferred from the fine measurements, augmented by a quadratic term that captures the correlation structure of the coarse data. We illustrate our strategy by considering the problems of predicting the antimalarial potency and aqueous solubility of small organic molecules from their 2D molecular structure.. L.βJ.βC. acknowledges a Next Generation fellowship and a Marie Curie CIG [Evo-Couplings, Grant No. 631609]. M.βP.βB. acknowledges support from the Simons Foundation and from the National Science Foundation through DMS-1715477
Proline provides site-specific flexibility for in vivo collagen.
Fibrillar collagens have mechanical and biological roles, providing tissues with both tensile strength and cell binding sites which allow molecular interactions with cell-surface receptors such as integrins. A key question is: how do collagens allow tissue flexibility whilst maintaining well-defined ligand binding sites? Here we show that proline residues in collagen glycine-proline-hydroxyproline (Gly-Pro-Hyp) triplets provide local conformational flexibility, which in turn confers well-defined, low energy molecular compression-extension and bending, by employing two-dimensional 13C-13C correlation NMR spectroscopy on 13C-labelled intact ex vivo bone and in vitro osteoblast extracellular matrix. We also find that the positions of Gly-Pro-Hyp triplets are highly conserved between animal species, and are spatially clustered in the currently-accepted model of molecular ordering in collagen type I fibrils. We propose that the Gly-Pro-Hyp triplets in fibrillar collagens provide fibril "expansion joints" to maintain molecular ordering within the fibril, thereby preserving the structural integrity of ligand binding sites.BBSRC, EPSRC, Raymond and Beverly Sackler Fund for Physics of Medicine, Wellcome Trust, ER
- β¦